home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Workbench Add-On
/
Workbench Add-On - Volume 1.iso
/
DiskUtil
/
Crunch
/
XPK
/
DOCS
/
HUFF.DOC
< prev
next >
Wrap
Text File
|
1992-08-10
|
10KB
|
218 lines
HUFF
A dynamic huffman cruncher/decruncher
Version 0.62
Copyright 1992 by M.Zimmermann
License/Disclaimer
------------------
This library may be freely distributed with the XPK compression package,
as long as it is kept in its original, complete, and unmodified form. It
may not be distributed by itself or in a commercial package of any kind
without my permission.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.
About Huffmans
--------------
The idea of a huffman crunch is as follows: often used bytes (ie 8 bit
codes) get a shorter code than 8 bits, fi 5 bits. So everytime one of these
bytes occurs in the source file I save (in this example) 3 bits in the dest
file. To find out the optimum codes for each byte there is a simple method:
A tree is to be constructed so that the path from the root to each leaf
reflects the optimum (ie huffman) code. Unfortunately most computers (the
Amiga, too) are byte-oriented, which means a rather complex handling of
codes that are not a multiple of 8 bits. This results in pretty slow
compression & decompression. So this means that the xpkHUFF.library
probably won't be used for normal circumstances, but, as Dominik stated, it
may serve well as an example library.
There are three different huffman crunch algorithms known:
· static compression/decompression
· dynamic compression/decompression
· adaptive compression/decompression
What are the differences?
The static huffman uses a fix table to represent each byte of the source.
This, of course, makes sense only, if the structure of the data to be
crunched is known. In this case (for instance crunching an english text) a
fix table of codes is embedded in the code. Crunching other data than what
the table was generated for will probably result in worse compression or
even expansion.
This is what a dynamic huffman is avoiding: it first creates a
statistics table reflecting the frequency every byte occurs with and
generates a special tree/table for this case, so the dynamic huffman does a
good compression for this special data.
But there is something that can be improved, anyway: imagine, there is a
data block which contains many 'a's in it's first part and many 'z's in the
last part.... The dynamic huffman would represent the 'a's and 'z's with
short codes, of course. But it probably would be even better if the
crunch/decrunch tree would reflect the *current* data beeing processed
instead of the whole block, thus in resulting shorter codes for 'a' and 'z',
depending of the position in the data block. This is what an adaptive
huffman deals with: it creates the tree while crunching, without doing any
statistics or tree creation before crunching. It permanently updates it's
internal huffman tree. Therefore it doesn't have to include the information
about the decrunch tree in the crunched data.
Final words about huffmans ...
------------------------------
A stand-alone huffman will never achieve crunch results as fine as those
reached with most other crunchers, for these will not only regard the number
of occurances for each byte (as huffman does), but sequels of byte, too.
This means: If you create all permutations of a datablock, the huffman
crunched will always have the same length. Others won't, as they are
depending on the order of the crunched data, too.
Description
-----------
The library 'xpkHUFF.library' implements a dynamic huffman crunch
algorithm, even though the adaptive might result in slightly better crunch
results. However, this is more complex to implement and I'm using a maximum
buffer size of 64K, so this is a little bit like an adaptive huffman for
large files.
If I should have lots of spare time I will probably implement an adaptive
huffman crunch algorithm. This new library will be called xpkHUFF, too, and
new xpkHUFF.libraries will always handle output generated by earlier
versions.
The xpkHUFF.library supports a totally unsafe (but a little bit better
than simple eor :-) encryption. Please note that crunch/decrunch speeds
decrease when encryption is used.
The source ...
--------------
The source is provided, so you might have a look at different decrunch
codes. Please note: The only code I tested intensively is the byte/cache
decrunch code, which was used to create the library included in this
package. For it's the fasted code, you probably won't want to use another
one. However, this might help you to understand what huffmans are about.
If you have never written a huffman, have a look at it, a huffman code is
a pretty good idea. If you *have* written a huffman code and it's *faster*
than mine, please let me know, I will then update xpkHUFF.library.
Implementation
--------------
If you should see an errormessage saying output buffer too small while
crunching *and* encrypting, this means you tried to crunch and encrypt a
file that would crunched and encrypted be larger than the original file.
This should occur only with very small files (for I have a minimum file size
due to tables) or with files that have been crunched already and therefore
would expand during crunch.
A technical note: this could also happen, if the last chunk of a file to
be crunched/encrypted would be dimensioned too small by xpkmaster.library.
However, in this case you cannot encrypt the f cnt = in_idx - anchor) > 2)
{
if (cnt <= 18) /* short rle */
{
++cnt_srle;
*out_idx++ = cnt - 3;
*out_idx++ = c;
}
else
/* long rle */
{
++cnt_lrle;
cnt -= 19;
*out_idx++ = 16 + (cnt & 0x0F);
*out_idx++ = cnt >> 4;
*out_idx++ = c;
}
ctrl_bits = (ctrl_bits << 1) | 1;
continue;
}
/* look for pattern if 2 or more characters remain in the input buffer */
in_idx = anchor;
if ((inbuff_end - in_idx) > 2)
{
/* locate the offste of possible pattern i sliding dictionary */
++cnt_hashes;
hash = ((((in_idx[0] & 15) << 8) | in_idx[1]) ^ ((in_idx[0] >> 4) | (in_idx[2] << 4))) & hash_len;
pat_idx = hash_tbl[hash];
hash_tbl[hash] = in_idx;
/* compare characters if we're within 4098 bytes */
if ((gap = in_idx - pat_idx) <= 4098)
{
while (in_idx < inbuff_end && pat_idx < anchor && *pat_idx == *in_idx && (in_idx - anchor) < 271)
{
in_idx++;
pat_idx++;
}
/* store pattern if it's more than 2 characters */
if ((cnt = in_idx - anchor) > 2)
{
gap -= 3;
if (cnt <= 15)/* short pattern */
{
++cnt_spat;
*out_idx++ = (cnt << 4) + (gap & 0x0F);
*out_idx++ = gap >> 4;
}
else
{
++cnt_lpat;
*out_idx++ = 32 + (gap & 0x0F);
*out_idx++ = gap >> 4;
*out_idx++ = cnt - 16;
}
ctrl_bits = (ctrl_bits << 1) | 1;
continue;
}
}
}
/* can't compress this character so copy it to outbuff */
++cnt_nocompr;
*out_idx++ = c;
in_idx = ++anchor;
ctrl_bits <<= 1;
}
/* save last load of control bits */
ctrl_bits <<= (16 - ctrl_cnt);
*ctrl_idx = ctrl_bits;
/* and return size of compressed buf